Economic evaluation of interventions for treatment-resistant depression: A systematic review

Background The extraordinarily high prevalence of treatment-resistant depression (TRD), coupled with its high economic burden to both healthcare systems and society, underscore how critical it is that resources are managed optimally to address the significant challenge it presents. Objective To review the literature on economic evaluation in TRD systematically, with the aim of informing future studies by identifying key challenges specific to the area, and highlighting good practices. Methods A systematic literature search across seven electronic databases was conducted to identify both within-trial and model-based economic evaluations in TRD. Quality of reporting and study design was assessed using the Consensus Health Economic Criteria (CHEC). A narrative synthesis was conducted. Results We identified 31 evaluations, including 11 conducted alongside a clinical trial and 20 model-based evaluations. There was considerable heterogeneity in the definition of treatment-resistant depression, although with a trend for more recent studies to use a definition of inadequate response to two or more antidepressive treatments. A broad range of interventions were considered, including non-pharmacological neuromodulation, pharmacological, psychological, and service-level interventions. Study quality as assessed by CHEC was generally high. Frequently poorly reported items related to discussion of ethical and distributional issues, and model validation. Most evaluations considered comparable core clinical outcomes – encompassing remission, response, and relapse. There was good agreement on the definitions and thresholds for these outcomes, and a relatively small pool of outcome measures were used. Resource criteria used to inform the estimation of direct costs, were reasonably uniform. Predominantly, however, there was a high level of heterogeneity in terms of evaluation design and sophistication, quality of evidence used (particularly health state utility data), time horizon, population considered, and cost perspective. Conclusion Economic evidence for interventions in TRD is underdeveloped, particularly so for service-level interventions. Where evidence does exist, it is hampered by inconsistency in study design, methodological quality, and availability of high quality long-term outcomes evidence. This review identifies a number of key considerations and challenges for the design of future economic evaluations. Recommendations for research and suggestions for good practice are made. Systematic review registration https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=259848&VersionID=1542096, identifier CRD42021259848.


Introduction
Major depressive disorder (MDD) affects approximately 5% of the global population and continues to be a major contributor to the overall global burden of disease (1). There is strong evidence that the prevalence of MDD is increasing (2), with the COVID-19 pandemic driving prevalence rates yet higher. Response to the global health crisis and strategies used to prevent the spread of the virus, constructed an environment whereby factors contributing to MDD onset and reoccurrence were exacerbated; contributing to a 28% rise in global prevalence rates (3). Since many of these factors persist (including, but not restricted to: constrained healthcare resources; widened socioeconomic inequality; social isolation; neuropsychiatric sequelae), this trend is not expected to retreat in the near-term (4,5).
Response to treatment of MDD varies, with many patients requiring more than one treatment step (6). A third of patients do not report improved symptoms despite multiple interventions, resulting in a persistent form of depression commonly described as "treatmentresistant depression" (TRD) (7). Defining TRD is problematic, since failure to respond to treatment "exists on a continuum" (8). A recent review found that while the most widely used definition for TRD was a failure to respond to two or more treatments at an adequate dose and duration, only 19% of recent interventional TRD studies were consistent with that definition (9).
Reflecting this heterogeneity in classification of TRD (10), and indeed in the patient population (11), no single treatment pathway exists, although a stepped-care approach is recommended. Such a model aims to address scarce treatment resources by ensuring that the most effective, least restrictive treatments (in terms of both healthcare resources, and patient convenience), are delivered first, with patients "stepped up" to more intensive treatments as needed (12). Recent UK National Institute for Health and Care Excellence (NICE) guidelines (13) advocate starting treatment for moderate to severe MDD with psychological interventions, such as cognitivebehavioural therapy (CBT), combined with an antidepressant. Where symptoms persist after 4-6 weeks, additional treatments and referral to secondary/specialist mental health services should be considered. Further treatments may include increasing the antidepressant dose, switching to another antidepressant medication of the same or different class, switching to another psychological therapy, adding a second-generation antipsychotic or lithium, or augmenting with electroconvulsive therapy (ECT), lamotrigine, or triiodothyronine. Other treatment options include repetitive transcranial magnetic stimulation (rTMS) and implanted vagus nerve stimulation.
Despite this diverse armamentarium, there remains a high unmet need for new and cost-effective interventions (14,15). Unfortunately, the condition is highly recurrent-80% of TRD patients experience relapse within a year of remission and the probability of sustained remission over 10 years is just 40% (16). A well-established body of evidence has demonstrated that increasing treatment resistance is associated with poorer health-related quality of life (HRQoL) (8), increased direct medical costs (8,17,18), and indirect costs to society attributed to impairments in work productivity and activity (15,19), and social care demands (20).
Against a background of increasingly constrained healthcare budgets, it is important that decision makers consider not only clinical effectiveness, but the economic evidence for interventions, in order to identify and prioritize those that make the best use of available resources (21). Previous systematic reviews of economic evaluations of interventions for MDD have reported considerable uncertainty in their findings due to inconsistent methodological quality and results (22), and highlighted a lack of evidence and good quality data in TRD (23,24). Johnston et al. (8) reviewed the literature on the economic burden of TRD, and found significant methodological and population disparities, highlighting heterogeneity in defining TRD, the outcomes measured, and the health state utility values reported.
The aim of this review is to appraise the existing evidence and methods used in economic evaluations of interventions for TRD, and to make best-practice recommendations to inform the development of future evaluations. Promoting consistency in evaluation methodology will improve confidence when making resource allocation decisions, and increase the likelihood that promising interventions receive appropriate funding or support.

Concepts in health economic evaluation 2.1. Type of economic evaluation
A "full" health economic evaluation compares both the costs and the consequences of alternative courses of actions (25). The output of the evaluation is (typically) an incremental cost-effectiveness ratio (ICER) (26). Depending on the outcome measure used, economic evaluations may be classified as: cost-effectiveness analyses (CEA), when a clinical outcome measure is used; cost benefit analyses (CBA), when outcomes are valued in monetary terms; cost utility analysis (CUA), when health outcomes are valued as health state utilities to derive quality adjusted life years; cost consequence analysis (CCA), where multiple outcomes not easily summarized in a single summary measure are presented in a disaggregated format; and cost minimisation analysis (CMA), which assumes that the outcomes from the alternatives under consideration are equivalent (27).

Health state utilities
Health state utilities are used to represent the "value" of different health states, based on a surveyed population's strength preferences for those health states. Utilities are conventionally scaled between 0 and 1, with 1 representing the value of perfect health and 0 representing the valuation of death (28). Some systems allow a negative utility value, whereby very poor health states may be valued as less preferable than death. When measured over time, utilities may be used to derive the quality adjusted life years (QALYs) associated with living in a particular health state (29).

Perspective
The perspective of the evaluation refers to the breadth of costs and benefits that are to be considered in the evaluation. Most commonly, the perspective of the healthcare provider or payer is adopted; at the broadest, a "societal" perspective reflects a comprehensive range of social opportunity costs associated with the alternatives under consideration (30). Where significant opportunity costs exist outside the healthcare system, for example in public health interventions, a broad perspective is advised, and there is growing support for such a broad perspective to be used in mental health economic evaluation (21). The 2016 Second Panel on Cost Effectiveness in Health and Medicine recommends analysts adopt a comprehensive approach, reporting separately both healthcare sector and societal perspectives (31). The Panel further recommends the societal perspective report costs and consequences in a comprehensive "impact inventory, " and where possible, that non-health consequences are quantified and valued (31). While methodological guidance on choice of perspective varies by jurisdiction, it is generally agreed that the choice should be explicitly stated and determined by the study sponsor (and any stakeholders identified by the sponsor) (32).

Time horizon
The time horizon refers to the period over which the costs and benefits of the evaluation are captured. Choice of time horizon is influenced by the nature of the condition and intervention under evaluation, and the framework and purpose of the analysis. Ideally, the time horizon for economic evaluations should be sufficiently long to capture relevant differences in costs and outcomes between the comparators; for many interventions, this requires a lifetime horizon (33,34). Where extrapolated data are used, this is likely to require the analyst to make assumptions about the continued efficacy of the interventions (35).

Study design
Economic evaluations of health care interventions typically follow one of two study designs: "within-trial" evaluations, where the costs and benefits of alternative courses of action are collected alongside clinical data in interventional clinical studies; and those that use decision analytic models.

Within trial designs
Within-trial evaluations have the advantage that the costs and consequences of the interventions under investigation are measured directly, but are constrained by the follow-up period, frequently precluding assessment of long-term cost effectiveness (36). Extrapolation may be possible using survival analysis models, though this approach requires related long-term data on costs, benefits and complications of the interventions (37).
Sample size and power estimates for trials are most commonly based on the primary clinical outcome. Owing to the tendency of cost variables to have much greater variance than clinical outcomes, trial-based economic evaluations are often underpowered to detect statistically significant differences in cost (38). Accordingly, health economic evaluations assess the probability of cost effectiveness against a certain threshold of willingness-to-pay (WTP), rather than employing statistical hypothesis tests concerning cost effectiveness (37). Typically, probability of cost-effectiveness is assessed against a range of WTP values, and is represented in a cost effectiveness acceptability curve, representing from the joint distribution of incremental costs and effects (37,39). Most commonly, this distribution is estimated using non-parametric bootstrapping to address sampling uncertainty (39).
Best practice guidelines encourage the use of robust methods to address missing data, since exclusion of cases with missing or censored data may introduce bias (33). While several approaches may be adopted for handling missing data, (including complete case analysis, single imputation and inverse probability weighting), the use of multiple imputation models are usually recommended (40), although this approach may be contested when evaluating data with a high degree of missingness (41).
Combining methods for addressing sampling uncertainty and those for addressing missing data, however, is non-trivial and presents challenges both practical challenges (e.g., computational intensivity), and statistical challenges (e.g., the artificial reduction of sampling uncertainty through imputation) (33,42). There is a need for further research in this area, as currently no consensus exists for best practice approaches (41).

Decision analytic model designs
Decision analytic models may be used to extrapolate the findings of clinical trial over a longer "time horizon, " or to a different population, or may be used to compare interventions for which no head-to-head trials have yet been conducted. Economic models are mathematical abstractions of the real world: analysts will work with subject-matter experts to conceptualize a specific structure, the contingent assumptions, and required input parameters (43). The models describe the probability of specific outcomes following an intervention, with the costs and benefits of each outcome having an associated value. The expected value of that intervention is expressed as the sum of values for each outcome, weighted by the probability of the outcome (43).
Three approaches are commonly used in decision analytic economic evaluation models. The decision tree is a simple but widely used approach used to evaluate short-term prognoses, represented by a series of pathways (44). Markov cohort models may be used to evaluate outcomes over a lifetime horizon, and typically model a homogeneous population transitioning through a series of "health states." Transitions are modeled in a series of cycles (of a length defined by the analyst); a key property (and frequently a problematic assumption) of Markov models is that no "memory" of the events of previous cycles is retained through each transition (45). Individuallevel microsimulation models, which may take the same form as a Markov model, facilitate modeling of a heterogeneous population, and the impact of past events (e.g., number of treatment failures, or adverse events), on prognosis (46).
Analogous to bootstrapping in within-trial evaluations, parametric methods (e.g., Monte Carlo simulation), are recommended to generate sampling distributions of joint mean cost and efficacy estimates (47).

Methods
This systematic review follows guidance provided by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) group (48). The study protocol was registered with the International Prospective Register of Systematic Reviews (PROSPERO; registration CRD42021259848).

Eligibility criteria
Predefined inclusion criteria, defined by the Population, Intervention, Comparator, Outcome, Study type (PICOS) framework ( Table 1) were used to determine study selection. Evaluations were included if the author defined the population as "persistent/treatment-resistant/treatment-refractory depression, " within adult populations (i.e., individuals aged at least 18 years). Any intervention, across all treatment settings (primary, secondary, and/or community care), relating to the treatment or management of TRD were eligible. Evaluations were excluded if there was no comparator, where comparators could include placebo, an alternative to standard treatment, or treatment as usual.
Evaluation types included any "full" economic evaluation that considered incremental changes across both costs and consequences (CUA, CEA, CBA, CMA, and CCA).
Included evaluations were required to be full-length, peerreviewed interventional, observational, or modeling reports in journal or Health Technology Authority (HTA) publications in the English language. No date restrictions were imposed. Additionally,

Study selection
Records identified in the search strategy were uploaded to the Rayyan platform, 1 for de-duplication and screening. All papers were examined against the PICOS inclusion and exclusion criteria independently by two reviewers (RC and LH) in a two-stage process; title and abstract followed by full-text screening. Reviewers discussed conflicts after each phase and a consensus was reached.

Data extraction and quality assessment
Key study information was extracted using a pre-defined spreadsheet in Microsoft Excel. Two reviewers (RC and LH) conducted data extraction with a 30% overlap in evaluations. Level of agreement between the overlapping extractions were compared and discussed. Disagreements regarding the content of the extraction fields were resolved through discussion. Data extraction fields included: evaluation details (publication type, setting, objectives); population; general evaluation characteristics (type of intervention and controls, perspective, type of evaluation used, study design, time horizon and reference year); resource use and costs (type of category and costs, data source, and methods used to calculate costs); outcomes (primary clinical outcomes, other clinical outcomes, economic outcomes, and data source for outcomes); economic evaluation results (incremental costs and effects, summary measure of benefits, cost effectiveness results, analyses of uncertainty, and author's conclusions); and model-based evaluation characteristics (model type, model structure and assumptions, rationale for model type and structure, consideration of population heterogeneity).
The Consensus on Health Economic Criteria (CHEC) was used for quality-of-reporting assessment (49). The 19-item CHEC is recommended for systematic reviews that incorporate both trialbased and model-based economic evaluations (50). Additional items related to model conceptualization were included in the assessment: rationale for model type; rationale for model structure; whether sufficient information was provided to reproduce the model. These items have not been validated, but were informed by items within the "Phillips checklist" for decision analytic models (51).

Analysis
Evaluation characteristics, design, key cost and outcome parameters, and results were synthesized in summary tables and a narrative synthesis approach was used to describe common features and key differences amongst identified economic evaluations.

Search results and evaluation selection
The evaluation selection process is summarized in Figure 1. A total of 539 records were identified through the literature searches, and one more was found through screening reference lists (52). After removing 85 duplicates, 400 records clearly failed to meet the inclusion criteria, or met at least one exclusion criterion, leaving 52 for full-text screening. Of these, 31 satisfied the inclusion criteria and were selected for review (52-82).  The 31 evaluations, relate to 29 unique studies, with multiple economic evaluations included for two studies: a trial comparing the cost effectiveness of rTMS and ECT; (58,69) and a trial of CBT as an adjunct to pharmacotherapy (73,82). Of the 31 evaluations included, 11 were trial-based (predominantly psychological [n = 6], or servicelevel [n = 2] interventions), and 20 were model-based (predominantly non-pharmacological neuromodulation [n = 12] or pharmacological [n = 8] interventions). Twenty-four of the evaluations adopted a cost-utility analysis (CUA) as their primary analytical approach, six used cost-effectiveness analysis (CEA), and one adopted a costconsequence analysis (CCA) as the primary method of evaluation. Six evaluations used multiple analytical approaches. The median time horizon was 1 year, eight evaluations used a time horizon of less than a year, and only two evaluations considered a lifetime horizon. The primary analysis for most evaluations (n = 28) considered costs from a healthcare provider perspective, three evaluations considered a (partial) societal perspective, and seven presented both societal and healthcare provider perspectives. The evaluations came almost exclusively from high income countries (

Quality of reporting assessment
Quality of reporting of the evaluations was predominantly high; the range of fulfilled CHEC criteria across the evaluations fell between 47 and 100%, with an average of 83% of criteria fulfilled. Five evaluations met all criteria from the CHEC-list, and only two evaluations fulfilled fewer than 60% of the criteria (57,61). The lowest-scoring items from the checklist were: discussion of ethical and distributional issues (45% of evaluations); reporting of structural assumptions and validation methods of models (55% of relevant evaluations); consideration of the generalizability of the results (61% of evaluations). Additional items used to evaluate reporting of conceptualization of model-based evaluations were less well reported: only 15% provided a rationale for choice of model type, and 55% provided a rationale for the model structure. Results of the quality assessment are presented in Supplementary Table 2.

TRD population
There was variation in patient populations considered by the included evaluations, reflecting a lack of consensus on the definition of TRD (83). Most commonly, treatment resistance was defined as a failure to achieve an adequate response to antidepressive treatment (n = 24), with half of these specifying a requirement for failure of at least two lines of therapy. Three evaluations used a definition based on the number of previous episodes, or duration of the current episode, and four evaluations did not clearly define treatmentresistant depression or the studied population. At baseline, the populations considered were typically severely depressed, however, severity was not well defined in most model-based studies and had to be intuited from the utility values reported.

Clinical outcomes
Trial-based evaluations tended to use either response (n = 4) or change in depressive symptoms (n = 5) as their primary clinical Frontiers in Psychiatry 05 frontiersin.org PRISMA flow diagram of study identification, adapted from (48).
outcome, with only one evaluation using remission (in addition to change in depressive symptoms). Other outcomes included relapse (n = 2), and depression-free days. Model-based evaluations tended to include both response and remission (n = 13), with five evaluations modeling remission only, one evaluation modeling response, and one modeling change in depressive symptoms.
In trial-based evaluations, the most common outcome measure was the Hamilton Depression Rating Scale (HAM-D, n = 6), followed by the Beck Depression Inventory II (BDI-II, n = 3). Other measures included the Montgomery-Åsberg Depression Rating Scale (MADRS, n = 1), Symptom Checklist-90 (SCL-90, n = 1), Beck Depression Inventory (BDI, n = 1), and the Global Assessment of Functioning scale (GAF, n = 1). Model-based evaluations typically synthesized outcomes from multiple sources, where outcomes may have been measured using several scales, though most frequently mentioned scales included the MADRS (n = 10) and the HAM-D (n = 9).
Response was typically defined as an improvement of ≥50% from baseline against the scales used, however, there was some variation in the scores used to define remission (
Of the nine evaluations that used a CEA approach (including four as secondary analyses), the most common economic outcome measures were cost per unit change in depression scale rating (n = 4), and cost per remitter (n = 3). Alternative outcomes included cost per relapse prevented (78), and cost per depression-free day (80).
The single CCA evaluation used maintenance of response and maintenance of relapse as outcome measures (72). All clinical and health economic outcome measures used are summarized in Supplementary Table 3. 4.6. Resource use and cost data Generally, costs were well reported, although several evaluations only reported costs at an aggregate level (53,57,60,61,64,65). Trial-based evaluations primarily used self-report questionnaires to collect resource use data (n = 8), but also relied on registry or hospital chart data (n = 4), and claims databases (n = 2). Most model-based evaluations drew data from the literature (n = 10), claims databases (n = 6), or registry or hospital chart data (n = 5).
Direct costs reported for all evaluations included treatment costs, with most also including outpatient (n = 27) and inpatient costs (n = 26); only three evaluations explicitly included costs for adverse events (AEs). Reported detail concerning assumptions and  methods for estimating attribution of capital equipment costs for neuromodulation interventions varied considerably. Indirect costs were considered by the ten evaluations that considered a broader cost perspective, but the scope of items collected varied considerably. Most (n = 9) considered productivity (in most cases measuring only absenteeism, although one also measured presenteeism) (76) others additionally considered out-of-pocket payments (n = 4), informal care (n = 4), formal societal or community care (n = 3), or transport (n = 3), but no two evaluations included the same set of indirect cost measures.

Modeling approaches and scope
The details of the 20 models appraised are given in Tables 3A, B. Six evaluations used a decision tree approach, the majority of which (n = 5) were evaluations of non-pharmacological neuromodulation interventions, while the sixth compared novel selective serotonin reuptake inhibitors (SSRIs)/serotonin and norepinephrine reuptake inhibitors (SNRIs) and generic SSRIs (61). In keeping with the associated restrictions of this analytical approach, all used a short time horizon, typically 6 months or less. The decision trees largely followed a similar structure, modeling three possible outcomes: remission, response (with no remission), and non-response. A representation of the generic decision tree structure is shown in Figure 2.
There were several notable variations from this structure. Both Kozel et al. (60) and Ghiasvand et al. (57) assumed that any response equated to full remission. This is a significant limitation as it does not allow for partial improvements in symptoms, and thereby is likely to overestimate the benefits of interventions.
Kozel's model also allowed for relapse. The model described by Malone et al. (61) compared the costs and consequences of various pharmaceutical treatment regimens, and augmented this generic structure with further steps that considered adverse events (AEs), and treatment changes. While four of the six evaluations described the conceptualization of the model structure, none described the rationale for selecting a decision tree approach, and only half described any structural assumptions or indicated that any validation assessment was undertaken.
Twelve evaluations used Markov cohort models, and three extended this approach with more sophisticated Markov microsimulation models (all for neuromodulation interventions).
A key characteristic of this extension is that it enables the tracking of individual patient characteristics or event history through the model. Most Markov models had a minimum horizon of 12 months, but only two had a lifetime horizon (53,56). A similar "base" generic structure, shown in shown in Figure 3, was used across the majority of models, with three key "health states": remission, response, and relapse (and/or non-response).
Several evaluations extend beyond this base structure, varying the levels of complexity and sophistication. Seven evaluations preceded the Markov model with a decision tree to represent a distinct acute phase of treatment. Other additional health states used (either as Markov health states or transition health states) included: deathparticularly for models with a time horizon greater than 1 year (n = 7); treatment change (n = 7); severe depression (n = 6); discontinuation (n = 7); adverse events (n = 4); hospitalization (n = 2). Only one evaluation used an entirely different structure, modeling health states defined by four different levels of severity of depression (defined by MADRS score) (39).
The reporting of these models was generally good, with a majority describing a rationale for model structure (n = 9) and structural assumptions (n = 8). Nevertheless, some aspects of the health states included or omitted require some important limiting assumptions. Only seven models accounted for discontinuation of treatment, and none of those omitting discontinuation justified the omission. Discontinuation might feasibly be rolled into the "nonresponse" health state, however, this was not explicitly stated in any evaluation that omitted a "discontinued" health state; these may consequently overestimate treatment benefits by failing to account for discontinued patients. Of those evaluations that did include discontinuation, four either did not distinguish between discontinuation related to AEs or lack of efficacy, or assumed discontinuation due to AEs to be embedded in loss of treatment effect (53,55,67,68). These four evaluations therefore considered AEs implicitly, but assumed no continued impact on quality of life beyond that of discontinuation due to lack of efficacy -an assumption that may not hold for severe or long-lasting AEs. AEs were considered explicitly in only five evaluations. Two considered both costs and utility decrements associated with AEs (52, 62), two considered only utility decrement (64), and one considered only costs (71). The majority did not model AEs and in most cases a rationale was not given, although it was suggested in two evaluations that the impact of AEs was expected to be limited, and similar between comparators (55, 59). While this assumption may be true of some comparators, it is an important structural assumption to validate, as omission will bias toward those interventions that have higher rates of AEs.

Utility data
There was considerable heterogeneity used in approaches to sourcing utility data for use in cost utility models: 11 different     Generic Markov model structure for economic evaluations in TRD. Patients solid lines indicate pathways that are common amongst the majority of models; dashed lines indicate pathways that are included in a subset of models. Table 4 summarizes evaluation results. Consistency of results varied across interventions. Four evaluations comparing neuromodulation (rTMS or ECT) to TAU consistently found the intervention to be cost-effective, and a dominant strategy (both more effective and less costly) in three evaluations. Direct comparisons of ECT and rTMS, however, were less consistent: six favored ECT and four favored rTMS. The source of these variations is not immediately clear; however, those that favored ECT tended to have a shorter (<12 month) time horizon, which may not have been long enough to capture benefits of maintenance treatment with rTMS. Those that adopted a societal perspective tended to favor rTMS, reflecting the higher indirect costs (care, time off work) of ECT. There is no clear indication that study or model design biased results in either direction. Notably, over half of these evaluations did not explicitly define the patient population in terms of severity or number of previously failed treatments. There was variation in the treatment protocol used for rTMS, which is likely to have a considerable impact on costs, as will the extent to which capital costs are attributed across different evaluations.

Results of evaluations
Only three pharmacotherapy evaluations considered the same comparators (esketamine vs TAU). Two CUAs found that despite improved outcomes esketamine was unlikely to be cost effective. The third evaluation, which was industry-sponsored, used a CEA approach, and found esketamine was likely to be cost efficient. In addition to differences in analytical approaches used, the two CUAs had much longer time horizons (5 years and lifetime), compared to the 12 month CEA. It is likely that the consideration of relapse over those longer horizons had a significant impact on cost-effectiveness.
The evaluations evaluating psychological interventions, which were all trial-based, were generally consistent in their findings: two CUAs comparing CBT to TAU and one comparing ISTDP to TAU found that these interventions were likely to be cost effective; two CUAs comparing RO-DBT found that the intervention was highly unlikely to be cost effective. The key driver of the cost inefficiency for RO-DBT were the costs of intensive treatment.
We reviewed two trial-based evaluations of service-level interventions which are not directly comparable. A US-based collaborative care program was found to be cost effective (80), while an evaluation of a specialist depression service in the UK found limited additional benefits associated with the service and concluded it was unlikely to be cost-effective (77).
All except one evaluation explored uncertainty in parameters and/or results (72). Bootstrapping or similar methods were used to account for sampling uncertainty in almost all (n = 9) trialbased evaluations, while probabilistic sensitivity analyses were used to account for the joint uncertainty of all key parameters in over half (n = 13) of the model-based evaluations. Although most (n = 16) model-based evaluations conducted some degree of one-way sensitivity analysis, fewer than half (n = 9) conducted a comprehensive sensitivity analysis, incorporating all important variables. Key drivers of uncertainty included the probability of response and remission, utility values used for acute/severe depression, and cost of intervention (particularly for rTMS where number of treatment courses varied).

Discussion
The aim of this review was to appraise the literature systematically to describe the methods used in the economic evaluation of interventions for the management of TRD, to inform design and development of future evaluations in this field. We identified 31 evaluations, including 11 trial-based and 20 model-based evaluations. A broad range of interventions and designs were considered by the included evaluations, but almost half evaluated the cost effectiveness of neuromodulation interventions (rTMS and/or ECT), enhancing our ability to consider consistency of evaluation design, and the factors that most strongly influence results.
There was a distinct paucity of evidence relating to the economic evaluation of service-level interventions, with only two studies identified in the literature search. In their evaluation of a dedicated specialist depression service for TRD, Morris et al. (77) noted significant loss to follow-up during the trial and indicated the evaluation may have been underpowered to detect statistical improvements in symptoms at follow-up. It has been argued that the objective of economic evaluation is estimation of expected value of an intervention, and that decision making should therefore be based upon the weight of evidence, rather than the application of statistical inference rules (38,97). Lack of statistical significance may, however, suggest that there is value in obtaining further evidence (97).
Despite a growing interest in the application of digital technologies in the management and delivery of mental health care (98), no economic evaluations of such interventions were identified. Recent studies suggest the implementation of digital technologies (e.g., virtual reality, artificial intelligence) may improve diagnosis, intervention delivery, monitoring, access to care, and potentially reduce costs (98,99). Economic evidence supporting digital technologies in healthcare generally is underdeveloped: there is a clear need for early-stage economic evaluations to support the development of these promising approaches (100).
The quality of reporting as assessed by the CHEC criteria was generally good, and some aspects were found to be relatively consistent across the evaluations. Most evaluations considered comparable clinical outcomes -encompassing remission, response/non-response to treatment, and relapse. There was good agreement on the definitions and associated threshold for these outcomes, and these were assessed by a relatively small pool of clinical outcome measures. The resource criteria used to inform the estimation of direct costs including inpatient stays, outpatient appointments, and pharmaceutical costs, were reasonably uniform. Predominantly, however, there was a high level of heterogeneity in terms of evaluation design and sophistication, quality of evidence used (particularly with respect to health state utility data), time horizon, population considered, and cost perspective adopted. The impact of these inconsistencies is highlighted by the fact that despite the inclusion of 10 evaluations comparing rTMS and ECT, there is still inconclusive evidence as to the cost effectiveness of rTMS vs ECT. Our findings are in general agreement with the literature relating to economic evaluation of MDD, where reviews have found the evidence for multiple interventions to be inconclusive due to inconsistencies in evaluation design and methodological quality (21,22), and that the paucity of evidence related to long-term outcomes in TRD restricts our ability to inform the long-term value of interventions in TRD (23,24). In order to inform future economic evaluations in TRD, and promote greater consistency among them, a number of linked methodological considerations are identified and good practices suggested.

Evaluation population and incorporation of patient heterogeneity
There was considerable variation in the definition used to describe the TRD population under study, with a fifth of evaluations providing no explicit definition. The absence of a standardized definition of the population reduces the validity of comparison and data synthesis across evaluations (101). However, one must acknowledge that the population is highly heterogeneous, in terms of both degree of treatment resistance, and medical and psychiatric co-morbid conditions (102). Evaluations that restrict the their population to a narrow definition or TRD, or that model a homogeneous cohort will limit generalizability of the findings. Despite this, very few model-based evaluations in this review explored the impact of patient heterogeneity -and where Frontiers in Psychiatry 13 frontiersin.org   heterogeneity was considered, only a narrow range of aspects of heterogeneity were considered (age, gender, number of previous treatments). Equally, the under-reporting of severity at baseline is problematic when comparing economic evaluations, since this is likely to significantly impact outcomes (103). To improve consistency across economic evaluations, we suggest that the widely used TRD definition of "failure to respond to two or more treatments at an adequate dose and duration" (9) be used as the base case for evaluation. Reflecting the concept that various degrees of resistance exist (102), more sophisticated evaluations might consider staging (for example by number of previous treatments), or at least characterizing the study population in this manner. Good practice guidelines for health economic models already highlight the importance of consideration of heterogeneity (47,104). Cohort models can achieve this through sensitivity testing of results with alternative patient cohorts; more sophisticated patient-level models incorporate the facility to directly model heterogeneity.

Time horizon
The persistent and highly recurrent nature of TRD is not well reflected in many of the evaluations: the time horizon for most models was only 12 months, and the average for trials was 18 months. Only two evaluations used a lifetime horizon, extrapolating outcomes from clinical evaluations with follow-up periods of 12 months or less (53,56). A key driver for the use of models in economic evaluation is to extrapolate the results of clinical trials to a longerterm horizon (47). In the context of TRD, a short time horizon may underestimate the cost effectiveness of an intervention by failing to account for smaller incremental improvements in mental health (accruing substantially with a longer horizon), or the improvements that persist beyond the evaluation horizon -for example, MDD patients receiving cognitive therapy have been found to exhibit reduced relapse rates for up to 6 years (78). Conversely, bearing in mind the highly recurrent nature of TRD over periods of up to 36 months (105,106), cost effectiveness might be overestimated through censoring of relapse or recurrence events. Extrapolation implicitly introduces additional uncertainty into the model, but one must balance the impact of that additional uncertainty on results against the benefits of decision support that reflects the longer-term costs and consequences of the intervention in question.

Analytical framework
Most evaluations included in this study used a CUA design, typically estimating incremental QALY changes associated with each alternative, with only five (mostly older evaluations) using only a CEA or CCA design. While the CEA approach has advantages -the results can be more intuitive for decision makers, and uncertainty is reduced since conversion of outcome measures to utility scores is not required -the results are of lesser value than those of a CUA for informing resource allocation decisions. Firstly, there is no immediately obvious decision rule: at what threshold of cost should a depression-free day be considered cost effective, for example? Perhaps more important, though, is the facility enabled by CUA to evaluate the cost effectiveness of an intervention within the whole healthcare sector. Mental healthcare provision is underfunded globally (107), and budgets for provision of mental healthcare are typically not ringfenced, but must compete with other healthcare priorities. To justify support for novel interventions, commissioners must be able to appraise the value of those interventions within the context of these competing priorities -e.g., mental health vs cardiovascular disease.

Summary measures of benefit
The most common economic outcome measure was the QALY, in most cases estimated using the EQ-5D-3L measure. Model-based evaluations predominantly used low-quality evidence to inform this parameter: sources were typically outdated, used unsophisticated valuation methods, and were usually drawn from the broader MDD population, rather than TRD specific. There is good evidence that an increased number of treatment failures within an episode is associated with both increased depression severity and decreased HRQoL (8). This would indicate that HRQoL in TRD follows a somewhat distinct profile from the broader MDD population, and highlights the importance of using values specific to the population under study. Generic preference-based HRQoL measures are increasingly deployed in interventional evaluations (including eight described in this review): synthesis of contemporary data specific to the TRD population should therefore considered for future economic evaluations.
Generic measures are typically recommended over conditionspecific measures, since they facilitate comparable outcome collection across the healthcare spectrum, and (due to their brevity) are easy to collect (93). Despite their widespread use, however, there is a growing consensus amongst health economists working in mental health that generic measures such as the EQ-5D are not sufficiently sensitive to capture important changes in symptoms, functioning, or wellbeing in mental health conditions (108). While there is evidence that these issues may be valid in depression, concordance between generic HRQoL measures and clinical measures has been shown to reduce with severity (109). Partly in response to these concerns, there has been increased focus on measurement of wellbeing and quality of life in mental health (110), but to date, there exists no mental health domain-specific preference-based measure that has been sufficiently validated that it can be recommended as an alternative to the EQ-5D or the SF-6D. In the absence of such a measure, the quality of the evidence used to inform EQ-5D generated utility data is of particular importance, and extensive sensitivity testing of utility values is imperative. It should be noted that increasingly, the updated EQ-5D-5L (rather than the -3L) measure is used in interventional studies, owing to is superior psychometric properties (111). The value of supplementing a CUA with a secondary CEA or CCA analysis (for example incorporating mental-health specific outcomes, or patient preferences), in order to increase confidence in results, may additionally be considered.
Where a CEA approach was adopted, various outcomes were used (cost per remitter, cost per depression-free day, cost per relapse prevented, or simply incremental change in outcome). Cost per remission is arguably a more intuitive measure to present to decision makers, and conversion of the cost per unit change to this measure should be relatively straightforward, providing adequate availability of information.

Patient preference and priorities
Recent years have seen increasing interest in the adoption of a "values-based" framework for delivery of mental health care, explicitly incorporating the preferences, priorities, and values of mental health service users (112,113). The incorporation of patient preferences in decisions related to resource allocation is justifiable on grounds of both ethics (since patients have agency in the decisions that affect their health), and on improving outcomes (patients are more likely to engage with interventions that match their preferences) (114). Despite this, none of the evaluations described incorporated patient values, preferences, or priorities in the presentation of their analysis. The HTA report by Atlas et al. (53) incorporated feedback from patient advocates, importantly highlighting concerns that the clinical outcome measures typically used do not reflect the full burden of TRD, and calling for the incorporation of measures of impact on work, productivity, disability, and family or caregiver wellbeing. Elsewhere, patients have argued that remission is more accurately described by the presence of positive mental health features (optimism, vigor, and self-confidence) than the absence of symptoms (115). Although currently not a pre-requisite for HTA submissions, or best practice guidance, the growing recognition of the importance of the perspective of the patient in resource allocation decisions warrants serious consideration of how this might be incorporated explicitly in future economic evaluations. Longerterm objectives might consider the co-development of outcome measures that better reflect patient priorities; more immediately, methods such as discrete-choice experiments may be used to directly elicit and value both health and non-health impacts of interventions, facilitating direct incorporation of patient preferences in economic evaluations (116).

Reporting of resource use and cost data
Resource use in economic evaluation is highly context-specific -owing to the breadth of interventions, jurisdictions and cost perspectives considered by the evaluations in this study, a granular critical evaluation and comparison of resource use is unlikely to be informative. Focusing instead on broader resource item considerations, we found a reasonable level of consistency for direct costs across the evaluations. A third of the evaluations reviewed included indirect non-healthcare costs, although with considerable variation in the items included. In many cases this simply including productivity gains or losses which, when measured over relatively short time horizons, had a relatively small impact on results compared to the healthcare perspective. A minority considered a more comprehensive set of indirect costs. Variability in indirect costs that contribute to the broader "societal" perspectives is in part a reflection of the different contexts in which these evaluations were conducted: out-of-pocket costs, reliance on informal care, or transport costs may vary significantly between jurisdictions and in some cases may be so negligible that they are not considered for inclusion.
Good practice guidance relating to selection of costs for inclusion in economic evaluations recommends that either all relevant costs should be included, or (for more pragmatic studies) those costs that are most likely to meaningfully differ between comparators and thereby impact the result of the evaluation (47).

Perspective
The choice of cost perspective should be informed by the intended audience of the economic evaluation (47). Most commonly, the audience for economic evaluations is the payer; in the UK, NICE (whose remit is to determine if interventions should be funded by the NHS), requires that the perspective for economic evaluations should be that of the health service (104). Effective management of depression though, has been shown to have significantly greater impacts on productivity costs alone than on health care costs (21). When considering the global costs to society of poor mental health, choosing a narrow perspective that disregards those costs (or benefits) may be problematic, or even misleading.
Since mental health care is typically funded through public health care budgets, a health system perspective will be a pre-requisite for most decision makers, but we would reiterate the call from Knapp and Wong (21) that by providing a societal perspective in parallel, the broader societal impacts can also be taken into account. This broader perspective, however, is somewhat juxtaposed with our earlier recommendation that the primary analysis should use a CUA design. An immediate approach therefore might consider a secondary CCA analysis, adopting a societal perspective and reporting the non-health costs and benefits of alternatives.

Conceptualization and validation of model-based evaluations
None of the evaluations reviewed explicitly reported a formal conceptualization process, few presented a rationale for choice of model or model structure, and very few reported any robust validation of the model. The key health states described in most of the evaluations were consistent with established treatment goals of trials in MDD/TRD, including response, remission, and relapse (117). Sensitivity analyses of model-based evaluations frequently showed that it was these outcome parameters that were most likely to affect the results of the evaluations. Beyond these key endpoints, there was considerable variation in the structural complexity of modelbased evaluations. Adverse events were rarely considered explicitly, although a minority of evaluations indicated that they had been considered and dismissed as having a negligible impact. Similarly, discontinuation was rarely considered, and where it was the reasons for discontinuation were poorly described.
Good practice guidance recommends an explicit process of conceptual modeling prior to implementation, to arrive at an appropriate scope for perspective, time horizon, choice of model type and structure, and which outcomes and costs to consider (118). The requirement to explicitly detail model conceptualization in reports has recently been added to the NICE HTA manual section 4.6.3 (104).

Limitations
This review restricted search criteria to English language only evaluations; by excluding foreign language records, our review may have limited consideration of aspects of economic evaluation that are prioritized differently in non-English speaking jurisdictions.
The review was deliberately designed with a "broad-brush" approach. Our aim was to develop a resource to inform the design of future economic evaluations in TRD agnostic of intervention, setting, or perspective. The review consequently incorporated all intervention types and all study design types; however, this introduces heterogeneity into the review, and limits the detail with which differences between evaluations may be explored. In keeping with the broad-brush approach, evaluation appraisal and recommendations are necessarily made at a generic level, and are not specific to context. Comparative evaluation of the results of included studies was conducted at a superficial level to illustrate how different evaluation design considerations may influence study conclusions. Where comparison of results is undertaken to inform resource allocation decisions, it is critical that context is accounted for. Key factors that should be considered in further detail in such comparisons include severity; number of previously failed treatments; treatment setting; and jurisdictional variations in resource costs and costeffectiveness thresholds.

Conclusion
Consistent with reviews of economic evaluations in MDD (23), our review found that the economic evidence for interventions in TRD is underdeveloped, particularly so for service-level interventions. Where evidence does exist, it is hampered by inconsistency in study design, methodological quality, and availability of high quality long-term outcomes evidence. Consequently there is limited data available to reassure policy makers involved in commissioning interventions and services in TRD of their cost effectiveness.
To strengthen the evidence base, this review identifies a number of key considerations and challenges for the design of future economic evaluations. While some considerations may be addressed immediately (e.g., appropriately defining the evaluation population, and selection of appropriate time-horizon and perspective), we also identify longer term challenges related to methodology development and building consensus in the research community to promote consistency in study design. The lack of long-term outcomes data limits the value of current economic evaluations. In particular we identified a need for more robust health-state utility data specific to TRD; consensus for a core outcome set that incorporates the measures from which these are derived would be a significant step forward.
Reflecting the growing recognition of the importance of incorporating the values of the patient in resource allocation decisions, we also suggest there is a need to develop methods to incorporate those values in economic evaluation frameworks systematically.

Data availability statement
The original contributions presented in this study are included in this article/Supplementary material; further inquiries can be directed to the corresponding author.

Author contributions
LJ and CW designed the research programme this review belongs to. LH, JP, and RAC developed the protocol and study design. LH and RAC conducted the literature search, screening of reports, data extraction, analysis, and drafted the manuscript. RNC contributed to the interpretation of literature review and critical revision of the manuscript. All authors reviewed and contributed to the final draft of this manuscript. RNC's research was supported by the UK Medical Research Council (MR/W014386/1). The funders had no role in the design, review, interpretation, writing of the report, or the decision to submit it for publication.